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Abstract 

We  investigate  an  integrated  approach  to  fault  tolerance  and 
dynamic  power  management  in  real-time  embedded  systems.  Fault 
tolerance  is  achieved  via  checkpointing  and  power  management  is 
carried  out  using  dynamic  voltage  scaling  (DVS).  We  present 
feasibility-of-scheduling  tests  for  checkpointing  schemes  for  a 
constant  processor  speed  as  well  as  for  variable  processor  speeds. 
DVS  is  then  carried  out  on  the  basis  of  these  feasibility  analyses. 
Experimental  results  show  that  compared  to  fault-oblivious 
methods,  the  proposed  approach  significantly  reduces  power 
consumption  and  guarantees  timely  task  completion  in  the  presence 
of  faults. 

1.  Introduction 

Fault  tolerance  techniques  are  needed  to  ensure  the 
dependability  of  embedded  systems  that  operate  in  harsh 
environmental  conditions.  These  embedded  systems  also  operate 
under  severe  energy  limitations.  In  addition,  many  embedded 
systems  execute  real-time  applications  that  require  strict  adherence 
to  task  deadlines.  In  this  paper,  we  investigate  an  integrated 
approach  that  provides  fault  tolerance  and  dynamic  power 
management  (DPM)  in  hard  real-time  embedded  systems.  We 
extend  a  recent  energy-aware  adaptive  checkpointing  scheme  that 
considers  a  single  task  in  a  soft  real-time  system  [1], 

Dynamic  voltage  scaling  (DVS)  is  a  popular  technique  for 
reducing  power  consumption  during  system  operation  [2,  3],  Fault 
tolerance  is  typically  achieved  in  real-time  systems  through 
checkpointing  [4],  At  each  checkpoint,  the  system  saves  its  state  in 
a  secure  device.  When  a  fault  is  detected,  the  system  rolls  back  to 
the  most  recent  checkpoint  and  resumes  normal  execution.  The 
checkpointing  interval,  i.e.,  duration  between  two  consecutive 
checkpoints,  must  be  carefully  chosen  to  balance  checkpointing 
cost  with  the  re-execution  time. 

DPM  and  fault  tolerance  for  embedded  real-time  systems  have 
largely  been  studied  as  separate  problems  in  the  literature.  DVS 
techniques  for  power  management  do  not  consider  fault  tolerance 
[2,  3],  and  checkpoint  placement  strategies  for  fault  tolerance  do 
not  address  DPM  [5,  6],  It  is  only  recently  that  an  attempt  has  been 
made  to  combine  fault  tolerance  with  DPM  [  1  ] . 

There  are  three  main  reasons  for  combining  DPM  with  fault 
tolerance  in  real-time  embedded  systems.  Increased  die 
temperatures  due  to  higher  processor  speeds  create  thermal  stresses 
on  the  die  and  undermine  system  reliability.  In  order  to  mitigate 
reliability  problems  caused  by  high  die  temperatures,  we  can  either 
lower  energy  consumption  through  DPM  techniques  such  as  DVS, 
or  we  can  adopt  fault  tolerance  techniques  such  as  checkpointing. 
Better  still,  a  combination  of  DVS  and  checkpointing  can  be  used. 
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The  second  reason  is  motivated  by  the  need  to  meet  the  task 
deadlines  in  real-time  systems.  If  faults  occur  frequently,  the 
processor  speed  can  be  scaled  up  dynamically  (within  limits 
imposed  by  higher  die  temperatures)  and  more  slack  can  be 
provided  to  the  task,  which  allows  more  time  for  rollback  recovery. 

The  third  motivation  arises  from  shrinking  process  technologies 
in  the  nanotechnology  realm.  Lower  processor  voltages  are  likely 
to  lead  to  lower  noise  margins  and  more  transient  faults,  caused  in 
part  by  single-event  upsets. 

We  first  present  feasibility  tests  for  fixed-priority  real-time 
systems  with  checkpointing  under  constant  processor  speed. 
Following  this,  we  extend  these  feasibility  tests  to  variable-speed 
processors.  Based  on  the  results  of  the  feasibility  analyses,  an  on¬ 
line  dynamic  speed-scaling  scheme  is  further  developed  to  reduce 
energy  during  task  execution.  The  proposed  approach  is  compared 
with  a  fault-oblivious  DVS  scheme  in  the  presence  of  faults. 

2.  Feasibility  Analysis  Under  Constant  Speed 

We  are  given  a  set  U=  {zj,  zy  ....  Z„}  of  n  periodic  real-time 
tasks,  where  task  Zj  is  modeled  by  a  tuple  Zj  =  (I),  £>,,  £,).  The 
elements  of  the  tuple  are  defined  as  follows:  7)  is  the  period  of  Zj, 
and  Dj  is  its  deadline  (D,  <  Tj);  E,  is  the  execution  time  of  T,  under 
fault-free  conditions.  Let  the  checkpointing  cost  be  C.  We  make  the 
following  assumptions  related  to  task  execution  and  fault  arrivals: 

(i)  The  task  set  r is  scheduled  using  fixed-priority  methods  such  as 
the  rate-monotonic  scheme  [7];  (ii)  the  task  set  r  is  schedulable 
under  fault  free  conditions;  (iii)  the  priority  of  tasks  are  in 
decreasing  order  of  the  index  i,  i.e.,  task  z,  has  higher  priority  than 
task  Zj  if  i  <  j\  (iv)  each  instance  of  the  task  is  released  at  the 
beginning  of  the  period;  (v)  the  checkpointing  intervals  for  a  task 
are  equal;  (vi)  the  times  for  rollback  and  state  restoration  are  zero; 
(vii)  faults  are  detected  as  soon  as  they  occur,  and  (viii)  no  faults 
occur  during  checkpointing  and  rollback  recovery. 

In  [8],  a  feasibility  analysis  is  provided  under  the  assumption 
that  two  successive  faults  arrive  with  a  minimum  inter-arrival  time 
TF.  This  is  not  practical  for  realistic  applications,  where  the  fault 
occurrence  can  be  bursty  or  memoryless.  Therefore,  we  focus  here 
on  tolerating  up  to  a  given  number  of  faults  during  task  execution. 
No  additional  assumption  is  made  regarding  fault  arrivals. 

Since  the  task  set  is  periodic,  the  total  execution  time  can  be 
very  high  if  we  consider  a  large  number  of  periods.  We  therefore 
need  to  identify  an  appropriate  fc-fault-tolerant  condition  for  shorter 
time  duration.  Here  we  provide  two  solutions  corresponding  to  two 
different  fault-tolerance  requirements.  One  is  to  tolerate  k  faults  for 
each  job,  termed  as  job-oriented  fault-tolerance;  the  other  is  to 
tolerate  k  faults  within  a  hyperperiod  (defined  as  the  least  common 
multiple  of  all  the  task  periods  [7]),  termed  as  hyperperiod-oriented 
fault-tolerance. 

We  first  consider  the  case  of  a  single  job.  Suppose  m 
checkpoints  are  inserted  equidistantly  to  tolerate  k  faults  in  one  job. 
The  worst-case  response  time  R  for  the  job  is  composed  of  three 
terms:  the  task  execution  time  E,  the  checkpointing  cost  mC,  and 
the  recovery  cost  kE/(m+ 1 ),  i.e.  R  =  E  +  mC +  kE!(m  + 1) .  To 
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satisfy  the  deadline  constraint,  we  must  have 
E  +  mC  +  kE  l(m  + 1)  <  D  . 

Let  f(m)  =  E  +  mC  +  kE/(m  + 1 )-  D  .  The  minimum  value  of 
f(m)  is  obtained  for  m  =  m0  =  -JkE/C  - 1  .  Since  m  is  a  non-negative 

integer,  we  have  mQ  =  Pmax(-\/fcE/ C  -  1,0)^. 

If /(mo)  <  0,  there  exists  equidistant  checkpointing  schemes  for 
k-fault-tolerance,  and  the  response  time  is  minimum  when  m0 
checkpoints  are  inserted.  If  f(m0)  >  0,  then  no  equidistant 
checkpointing  schemes  exists  for  tolerating  up  to  k  faults. 

The  feasibility  analysis  for  more  than  one  job  is  based  on  the 
time-demand  analysis  for  fixed-priority  scheduling  [7].  The  steps  in 
the  analysis  are  as  following: 

(1)  Compute  the  response  time  F,  for  according  to  the 

equation:  R,  =  E,+  |~/?  / Th  ~\Eh  .  Here  Tb  and  Eh  are  the  period 

and  the  execution  time  of  a  task  Th  with  higher  priority  than  q.  This 
equation  can  be  solved  by  forming  a  recurrence  relation: 

R^"  =  El+Yl'jR!i)IT^E,l.  (1) 

(2)  The  iteration  is  terminated  either  when  Riu+t)=Rj<n  and 
Riu><D!  for  some  j  or  when  R}1+1>  >  D, ,  whichever  occurs 

sooner.  In  the  former  case,  T,  is  schedulable;  in  the  later  case,  tj  is 
not  schedulable. 

According  to  [7],  the  time  complexity  of  the  time-demand 
analysis  for  each  task  is  0{nR ),  where  R  is  the  ratio  of  the  largest 
period  to  the  smallest  period. 

2.1  Job-oriented  fault-tolerance:  tolerating  k  faults  in  each  job 

In  this  case,  we  require  that  the  task  set  can  meet  the  deadline 
requirement  under  the  condition  that  at  most  k  faults  occur  during 
the  execution  of  each  job. 

Under  the  worst-case  condition,  the  additional  time  due  to 
checkpointing  and  recovery  should  be  incorporated.  When  there 
are  m;  equidistant  checkpoints  for  each  instance  of  Tj,  we  have: 

R,  =  (F,  +  miC  +  kEi  /(m,  + 1))  +  !Th\Eh  +  m„C  +  kEh  /(mh  + 1)) 

To  minimize  all  response  times,  we  must  have: 
m *  =  [ma x(4kEi7c  -1,0)]  (1  <  i  <  n) .  Then  we  can  employ  the 
recurrence  equation  as  follows: 

R,u+"  =(5+#ii‘C+H?/<to‘+  D)+X=‘,  fc'^1  !))• 

When  R.u+1)  =  R,u)  and  R'1’  <  D,  for  some  j,  T,  is 
schedulable;  when  7?/'"'  >  D, ,  q  is  not  schedulable.  The  total  time 

complexity  here  is  O(itR)  ,  where  R  is  the  ratio  of  the  largest 
period  to  the  smallest  period. 

Example  1:  Consider  a  task  set  composed  of  two  tasks:  q  =  (60, 

18,  7)  and  q  =  (80,  34,  8),  and  let  k  =  3,  C  =  1.  Then  /jq  =  4  and 
m2  =  4.  After  applying  the  recurrence  equation,  we  get  the 
response  times:  R [  =  15.2  <  18;  R2  =  33  <  34.  Thus  checkpointing  is 
feasible  for  this  task  set  if  up  to  three  faults  occur  during  each  job. 
Next  we  examine  the  case  of  k  =  4.  For  this  case  m\  =  5  and  m2  = 
5.  The  response  times  are:  Rt  =  16.7  <18  and  R2  =  35  >  34.  As  a 
result,  checkpointing  is  not  feasible  if  up  to  four  faults  need  to  be 
tolerated  for  each  job. 


2.2  Hyperperiod-oriented  fault-tolerance:  tolerating  k  faults  in 
a  hyperperiod 

In  [8],  an  algorithm  is  presented  to  determine  the  checkpointing 
interval  under  the  assumption  that  two  successive  faults  arrive  with 
a  minimum  inter- arrival  time  TF.  Let  Fp  1  <  j  <  i  ,  be  the  extra 

computation  time  needed  by  Tj,  1  <  j  <i  ,  if  one  fault  occurs  during 
the  execution.  When  there  are  trij  equidistant  checkpoints  for  Tj,  the 
response  time  F,  for  qis  expressed  as  follows  in  [8]: 

Ri  =  (E,  +m,C)  +  ITh]  (Eh  +  mhC)  +  \R,  !Tf  Imax^jF,  j , 

where  Fj  =  /(m;  +  1) . 

The  checkpoint  is  examined  starting  from  high-priority  tasks  to 
low-priority  tasks.  For  each  task  Tj,  the  algorithm  tries  to  reduce  the 
response  time  by  reducing  the  maximum  additional  computation 
time,  i.e.,  maxK/s.{  F. }  .  The  details  in  [8]  are  as  follows: 

(1 )  Initially  m,  =0  for  !<(<«. 

(2)  Starting  from  the  highest-priority  task  T\,  calculate  the 
minimum  number  of  checkpoints  m{  required  to  make  it 
schedulable. 

(3)  In  decreasing  order  of  task  priorities,  calculate  the  response 
time  Rj  of  task  Tj.  If  Rj  <  Dj ,  move  to  the  next  task;  otherwise  Rj 
needs  to  be  reduced  further.  The  only  way  to  reduce  Rj  is  to  add 
more  checkpoints  to  decrease  the  re-execution  time  caused  by 
faults,  i.e.,  Fp  for  1  <  j  <  i .  In  fact,  the  parameter  maxls  sj{F  }  is 

relevant  here  and  should  be  reduced.  The  task  t  that  contributes 
the  most  to  the  task  re-execution  time  is  found  and  one  more 
checkpoint  is  added  to  T .  Then  Rj  is  recalculated.  This  process  is 
repeated  until  either  R .  <  D,  or  the  deadline  D,  is  exceeded. 

While  the  schedulability  test  in  [8]  provides  useful  guidelines 
on  task  schedulability  in  the  presence  of  faults,  its  drawback  is  that 
two  key  issues  that  affect  schedulability  are  not  addressed. 

1.  Checkpoints  are  added  to  the  higher-priority  tasks  in  certain 
iterations  in  order  to  satisfy  deadline  constraints  for  all  the  tasks. 
These  higher-priority  tasks,  however,  have  met  their  deadline  in 
earlier  iterations.  The  addition  of  more  checkpoints  to  them 
inevitably  changes  their  response  times.  As  a  result,  it  is  necessary 
to  trace  back  to  re-calculate  their  response  times  and  adjust  their 
checkpoints.  This  issue  has  not  been  addressed  in  [8], 

2.  It  is  necessary  to  determine  a  bound  on  the  number  of 
checkpoints  beyond  which  the  addition  of  checkpoints  does  not 
improve  schedulability.  In  [8],  the  schedulability  test  concludes  that 
Tj  is  not  schedulable  once  Rj  increases  during  the  addition  of 
checkpoints.  However,  this  does  not  always  hold.  We  present  a 
counterexample  below. 

Example  2:  Consider  two  tasks  Tt  =  (100,  18,  7.999)  and  t2  = 
(101,  21,  8),  and  let  TF  =  102,  C  =  0.1.  We  follow  the  steps  from 
[8]  as  shown  below: 

(1)  Initially  uq  =  m2  =  0,  and  Fj  =  7.999,  F2  =  8; 

(2)  Next  T\  is  examined:  R i  =15.998  <  18.  No  checkpoints  are 
needed  for  q.  Thus  )?q  =  m2  =0. 

(3)  Next  Tj  is  examined:  R2  =  23.999  >  21.  Since  F2  >  Ft,  one 
checkpoint  is  added  to  Tj,  thus  «q  =  0  and  m2  =1.  Then  Ft  =  7.999, 
F2  =  4  and  max1£ :<2{Fj}  =  7.999  .  We  recalculate  the  response 

time  R2  =  24.098  >  23.999.  According  to  [8],  T2is  not  schedulable. 
However,  this  is  not  correct.  We  continue  the  above  step  and  find 
F i  >  F2,  then  one  more  checkpoint  is  added  to  q;  as  a  result  nq  =  1, 
m2  =  1.  Then  F,  =  7.999/(1  + 1)  =  3.9995 ,  F2  =  4,  and 
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maxi<j<-’{Fj !  =  4  •  We  recalculate  the  response  time  of  Tj  and  T2: 

Rt  =  12.0985  <  18  and  R2  =  20.199  <  21,  which  implies  that  both 
tasks  are  schedulable. 

We  require  here  that  the  tasks  meet  their  deadlines  under  the 
condition  that  at  most  k  faults  occur  during  a  hyperperiod.  Based  on 
the  schedulability  test  in  [8],  we  solve  the  two  aforementioned 
problems  as  follows. 

The  response  time  Rj  for  z;is  expressed  as: 

Ri={Ei+miQ  +  Y^Ri I Th ]  (Eh  +mhQ+km^jii{Fj} , 
where  F-  =  Ej  /(m;  + 1) . 

The  first  problem  can  be  solved  using  a  recursive  method.  Any 
time  we  increase  the  number  of  checkpoints  for  a  task,  all  the 
lower-priority  tasks  need  to  be  re-examined.  We  solve  the  second 
problem  by  determining  a  bound  on  the  number  of  checkpoints 
such  that  if  the  task  set  cannot  be  made  schedulable  using  this 
number  of  checkpoints,  it  cannot  be  scheduled  by  adding  more 
checkpoints.  Both  the  checkpointing  cost  and  the  timing  constraints 
must  be  taken  into  account. 

(1)  Analysis  of  a  bound  based  on  checkpointing  tradeoffs 

The  effect  of  adding  more  checkpoints  is  two-fold.  First  it 
increases  the  execution  time  due  to  the  checkpoint  cost,  which  runs 
contrary  to  the  goal  of  reducing  the  response  time.  On  the  other 
hand,  it  decreases  re-execution  due  to  a  fault,  which  helps  in 
reducing  the  response  time.  Suppose  the  task  execution  time  is  E 
and  m  checkpoints  have  already  been  added.  If  another  checkpoint 
is  now  added,  the  reduction  of  re-execution  time  under  the  fc-fault- 
tolerance  requirement  is  simply: 
kE  l(m  + 1)  -  kE  Km  +  2)  =  kE  l[(m  + 1  )(m  +  2)] . 

We  combine  the  two  impacts  of  checkpointing  on  the  re- 
execution  time  to  define  the  tradeoff  function  trim)  as: 
tr(m)  =  C  -kE /[(m  + 1  )(m  +  2)] . 

If  tr(m)  <  0  ,  then  adding  one  more  checkpoint  can  potentially 
reduce  the  response  time;  otherwise,  it  is  not  helpful  since  it 
increases  the  task  re-execution  time  due  to  the  k  faults. 

For  each  task  Tj  with  m,  checkpoints,  we  can  calculate  the 
tradeoff  function  fr,(m,).  Solving  for  trim')  =  0,  we  get: 

ntj  '=  (-3  +  ^1  +  4 kEj  /  C )  /  2  for  1  <i<n.  Since  m,  ’>  0,  we 

further  express  it  as:  m,  '=  max([(— 3  +  ^lT  4fcE,  /  C )  /  2 J  ,  0)  for 

1  <  i  <  n .  This  gives  an  upper  bound  on  the  number  of  checkpoints, 
which  is  based  on  the  tradeoff  function. 

(2)  Analysis  of  a  bound  based  on  timing  constraints 

Under  fault-free  conditions,  the  response  time  R°  for  task  Tj 
can  be  easily  obtained.  After  incorporating  the  checkpointing  cost 
and  timing  constraints,  we  have:  R°  +  mtC  <  Z).  ,  which  implies  that 

m.  <  (L>  -R°)IC  .  Let  m‘  =  [(Dt ,  -fl°)/cj. 

Combining  the  two  bounds,  we  define 
m,  =  minim/, m*)  (1  <  i  <  ri) .  Then  m *  is  a  tighter  upper  bound 

on  the  number  of  checkpoints  required  to  make  T,  schedulable. 

A  checkpointing  algorithm  ADV-CP  for  off-line  feasibility 
analysis  is  described  in  Figure  1,  which  takes  as  an  input  parameter 
the  real-time  task  set  F  All  tasks  are  initially  set  unschedulable. 
The  recursive  checkpointing  procedure  CP(p.q)  is  described  in 
Figure  2,  where  p  and  q  are  the  lowest  and  highest  index  for  the 
task  subset  under  consideration. 


The  recursive  execution  of  CP(p,q)  takes  0(n2R mt 

time.  Let  M*  =  ^  m‘  .  Adding  all  the  cost  together,  the  total 
complexity  for  the  feasibility  test  &  checkpointing  procedure  is 
0(n2RM) ,  which  is  only  quadratic  in  the  number  of  tasks  n. 
Furthermore,  we  note  that  the  complexity  can  be  reduced  if  we  can 
make  M  as  small  as  possible.  That  is  why  we  combine  both  the 
tradeoff  function  and  timing  constraints  to  obtain  a  relatively  tight 
bound  for  m,  . 

3.  Feasibility  Analysis  with  DVS 

We  are  given  a  variable-speed  processor,  which  is  equipped 
with  1  speeds  f\,f2,  •••,//•  In  addition,  /  </  if  i  <j.  Let  c  be  the 
number  of  clock  cycles  that  a  single  checkpoint  takes.  We  are  also 

given  a  set  F=  {  ry,  t2 .  T„]  of  n  periodic  real-time  tasks,  where 

task  Tj  is  modeled  by  a  tuple  Tj  =  (Tj,  £),,  £,).  The  elements  of  the 
tuple  are  defined  as  follows:  T )  is  the  period  of  Tj  and  D,  is  its 
deadline  (D,  <  Tj);  Ej  is  the  number  of  computation  cycles  of  Tj 
under  fault-free  conditions. 

In  addition  to  the  assumptions  in  Section  2,  we  assume  the  task 
set  r is  schedulable  under  fault  free  conditions  at  the  lowest  speed. 
For  the  sake  of  simplicity  of  presentation,  we  also  assume  without 
loss  of  generality  that  speed  switching  does  not  incur  extra  cost  in 
terms  of  time  and  energy. 

We  note  that  if  supply  voltage  Vdd  is  used  for  a  task  with  N 
single-cycle  instructions,  the  energy  consumption  can  be  expressed 
as  (a is  a  constant):  Eng(N)  =  CXNV2dd  (2) 

We  also  note  that  the  processor  clock  frequency  /  can  be 
expressed  in  terms  of  the  supply  voltage  Vdd  and  threshold  voltage 
V,  as  /  =  /3(VJd  - Vt)2/VM  ,  where  /Sis  a  constant. 

From  above,  we  obtain  Vdd  as  a  function  of/: 

vdd{f)  =  <y, +fivp))+4<y, +//(2/S))2  -v;2  0) 

According  to  Equation  (2),  energy  consumption  is  a  function  of 
N  and  /:  Eng(N,f )  =  aNV2dd  (/)  ,  where  Vdd{f)  is  expressed  in 
Equation  (3).  Here  we  assume  Vt=  0  without  loss  of  generality. 

In  our  proposed  scheme,  speed  scaling  can  be  done  for  a 
particular  application,  i.e.,  all  tasks  for  the  application  are  assigned 
the  same  speed,  or  at  the  task  level,  i.e.,  different  tasks  can  be 
assigned  different  speed.  Speed  scaling  can  also  be  carried  out  at 
the  job  level,  i.e.,  different  jobs  for  a  task  can  have  different  speeds. 
Let  sir,)  :  Tj  — >  fj  (1  <  i  <  n,  1  <j  <  Z)  denote  the  speed  scaling 
function,  which  maps  a  task  Tj  to  speed/. 

Our  aim  is  to  meet  task  deadlines  deterministically,  even 
though  k  faults  occur,  while  minimizing  energy  consumption.  First, 
we  need  to  identify  appropriate  time  duration  to  evaluate  the  energy 
consumption.  We  consider  the  hyperperiod  as  the  time  duration. 
Second,  the  criterion  of  minimizing  energy  consumption  needs  to 
be  clarified.  Based  on  the  application  requirement,  we  can  choose 
either  a  best-case  or  a  worst-case  energy  consumption  value.  By 
best-case,  we  refer  to  the  results  obtained  under  the  fault-free 
condition,  while  worst-case  refers  to  the  results  obtained  when  all  k 
faults  occur.  In  our  work,  we  focus  on  minimizing  energy 
consumption  under  the  worst-case  condition  during  a  hyperperiod. 
Let  the  hyperperiod  denoted  by  Pit  and  the  number  of  checkpoints 
for  Tj  denoted  by  m,;  the  total  energy  consumption  during  one 
hyperperiod  is  expressed  as: 

Total _eng  =  y  (Ht IT^EnglEj  +  mic  +  kE,  /(mi  +  l),i(r;j)  (4) 
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Figure  2:  Recursive  checkpointing  procedure. 


The  off-line  feasibility  analysis  with  DVS  provides  two 
important  pieces  of  information:  first,  it  provides  the  feasibility 
analysis  under  the  worst-case  scenario;  second,  it  provides  static 
results  such  as  speed  assignment  and  checkpoint  interval,  which 
can  be  further  used  for  on-line  adjustment  during  task  execution. 


3.1  Job-oriented  fault-tolerance  with  DVS 

The  worst-case  response  time  for  task  T,  can  be  expressed  as: 

I  Eh  +  mhc  +  kEh  / (mh  + 1) 


Ej  +  mfi  +  kEj  l(n\  +1)  Vll  R, 

Ri~ - ;tt - +  2J 

■S(T,)  4=1 


s(?h) 

To  minimize  all  response  times,  we 
/  =  [ma x(-\lkEi/C  -1,0)"|  (l<i<n).  Then  we  can  employ  the 


(5) 

must  have: 


recurrence  equation  as  follows: 

(j+i)  _Ei+m*c  +  kEi/{m-  +\)  a  R, 

s(Tt)  ,  1 


O') 


Eh+mh  c  +  kEhl(mh  +1) 


s{Th) 


If  R/+1)  =  R,(j>  and  R/;)  <  Dj  for  some  j,  z;  is  schedulable;  if 

R.ii+l)  >  Di ,  T,  is  not  schedulable. 

Since  the  optimal  number  of  checkpoints  is  fixed  a  priori  for 
each  task,  we  need  to  choose  appropriate  processor  speeds  to 
satisfy  the  deadline  constraint  for  each  task. 

(1)  Application-level  speed  scaling:  all  tasks  have  the  same  speed. 
Here  all  tasks  have  the  same  speed/  and  s(Zi)  =  s(^)  =  ...  =  s (t„) 
=/,  where/  e  t/1,/2,  ...,//}.  Equation  (5)  is  simplified  as: 


Ej+ntt  c  +  kEi/{mi  +1)  ^ 

a- - p - +  2. 

J  4=1 


R, 


Eh  +  mhc  +  kEh/(mh  +1) 


/ 


The  iterative  method  described  in  Section  2.1  can  be  used  here 
to  determine  /.  To  examine  the  feasibility  for  each  task,  all 
possible  speeds  have  to  be  tested.  There  are  /  possibilities  in  total. 
The  lowest  speed  that  satisfies  the  timing  constraints  is  selected  to 
minimize  energy  consumption. 

(2)  Task-level  speed  scaling:  different  tasks  can  have  different 
speeds.  To  obtain  an  optimal  solution,  we  use  an  exhaustive 
method.  Since  each  task  can  be  run  at  /  speeds,  there  are  l"  possible 
speed  combinations  for  n  tasks.  For  each  speed  combination,  the 
feasibility  test  is  performed  according  to  Equation  (5).  Meanwhile, 
the  energy  consumption  is  calculated  from  Equation  (4).  The  speed 
combination  that  satisfies  the  timing  constraints  with  the  minimum 


energy  consumption  is  chosen  as  the  optimal  solution. 

3.2  Hyperperiod-oriented  fault-tolerance  with  DVS 

The  worst-case  response  time  for  task  T,  can  be  expressed  as: 

R,  =  (Ei+m.c)/s(Ti)  +  '^h'jRiITh]  (Eh  +  mllc)/s(Tl,)  +  kmax{Fj} 

where  Fj  =  Ej  /[s(Tj)(ntj  + 1)] .  (6) 

(1)  Application-level  speed  scaling:  all  tasks  have  the  same  speed. 
Here  all  tasks  have  the  same  speed/ and  s(Ti)  =  s(t2)  =  ...  =  s(t„) 
= /,  where/  e  {/) , /2,  ...,//).  Equation  (6)  is  simplified  as: 

R,  =  (E,  +  rn.fi) !  f  +  ^/Jr,. /Th~\  (Eh  +  mhc)l  f  +  kmaxfFl} , 
where  F  ■  =  Ej  + 1)]  . 

In  contrast  to  (1)  in  Section  3.1,  we  first  fix  the  speed  instead  of 
the  number  of  checkpoints.  For  each  given  speed/,  we  examine 
the  feasibility  of  the  task  set  using  the  method  in  Section  2.2.  If  it  is 
schedulable,  the  corresponding  number  of  checkpoints  for  each  task 
can  be  obtained.  The  energy  consumption  is  calculated  from 
Equation  (4).  The  lowest  speed  that  satisfies  the  timing  constraints 
is  selected  to  minimize  energy  consumption. 

(2)  Task-level  speed  scaling:  different  tasks  can  have  different 
speeds.  To  obtain  an  optimal  solution,  we  use  an  exhaustive 
method.  Since  each  task  can  be  run  at  l  speeds,  there  are  F  possible 
speed  combinations  for  n  tasks.  For  each  speed  combination,  the 
feasibility  test  is  performed  according  to  Equation  (6).  The  method 
in  Section  2.2  is  employed  and  the  corresponding  number  of 
checkpoints  is  obtained.  Meanwhile,  the  energy  consumption  is 
calculated  from  Equation  (4).  The  speed  combination  that  satisfies 
the  timing  constraints  with  the  minimum  energy  consumption  is 
chosen  as  the  optimal  solution. 

3.3  Job-level  on-line  speed  scaling 

As  discussed  in  Sections  3.1  and  3.2,  the  speed  assignment  and 
the  checkpointing  interval  are  determined  by  the  off-line  feasibility 
analysis.  A  static  sequence  of  jobs  is  obtained  and  their  timing 
parameters  such  as  release  times  and  execution  times  are  known  a 
priori  under  the  worst  case.  However,  if  only  such  static  measures 
are  used  during  run-time,  it  will  not  be  possible  to  make  use  of  idle 
intervals.  Clearly,  further  energy  saving  is  possible  through 
additional  on-line  speed  scaling. 

The  on-line  speed  scaling  procedure,  done  at  the  job-level,  is 
adaptive  with  respect  to  fault  occurrence.  It  makes  use  of  a  simple 
run-time  adaptation  mechanism.  The  key  features  are: 

•  Once  a  job  completes,  the  release  time  of  the  next  job  is  adjusted 
dynamically  during  run-time. 

•  The  processor  is  run  at  an  appropriate  speed  such  that  either  the 
current  job  completes  before  its  deadline,  or  before  the  static 
release  time  of  the  next  job,  whichever  is  sooner. 

4.  Experimental  Results 

In  this  section,  we  compare  the  performance  of  our  energy- 
aware  fault-tolerance  scheme  with  the  DVS  technique  proposed  in 

(3) ,  referred  to  as  VSLP.  Our  goal  here  is  to  highlight  the  impact 
of  fault  occurrences  on  a  fault-oblivious  DVS  scheme. 

We  use  the  following  notation  to  refer  to  the  various  types  of 
schemes:  (1)  JFTA:  job-oriented  fault  tolerance  with  application- 
level  speed  scaling;  (2)  JFTT:  job-oriented  fault  tolerance  with 
task-level  speed  scaling;  (3)  HFTA:  hyperperiod-oriented  fault 
tolerance  with  application-level  speed  scaling;  (4)  HFTT: 
hyperperiod-oriented  fault-tolerance  with  task-level  speed  scaling. 
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Since  the  VSLP  scheme  as  presented  in  [3]  does  not  provide 
fault  tolerance,  we  assume  that  it  simply  re-executes  a  job  when  a 
fault  occurs.  Furthermore,  since  JFTA  is  a  special  case  of  JFTT  and 
F1FTA  is  a  special  case  of  HFTT,  we  compare  VSLP  with  the  JFTT 
and  HFTT  schemes.  For  both  cases,  we  first  show  that  JFTT  and 
HFTT  can  schedule  task  sets  even  when  VSLP  cannot  do  so;  we 
then  show  that  these  schemes  can  save  more  energy  via 
checkpointing  in  the  presence  of  faults. 

4.1  JFTT  vs.  VSLP 

As  pointed  out  in  [8],  a  system  of  periodic  preemptable  tasks, 
each  of  whose  relative  deadline  D  is  equal  to  its  period  T ,  is 
schedulable  on  one  processor  according  to  the  rate  monotonic 
algorithm  if  and  only  if  its  total  task  utilization  is  equal  to  or  less 
than  1.  In  the  presence  of  faults,  since  the  re-execution  takes  extra 
time  and  the  total  task  utilization  will  be  increased  accordingly,  a 
task  set  that  is  schedulable  under  fault-free  conditions  may  no 
longer  be  schedulable.  Here  we  construct  a  task  set  whose  total 
utilization  is  greater  than  1  for  VSLP  under  faulty  conditions  even 
though  the  entire  task  set  is  executed  with  the  highest  speed,  and 
show  that  this  task  set  is  schedulable  using  JFTT. 

Suppose  we  are  given  three  tasks  T\  =  (12000,  12000,  2200),  r2 
=  (18000,  18000,  3000)  and  r3  =  (24000,  24000,  4000),  and  three 
normalized  processor  speeds  1.0,  0.8  and  0.6.  Let  a  single 
checkpoint  take  c  =  50  cycles.  If  only  k  =  I  fault  occurs  during  each 
job,  the  total  utilization  for  VSLP  under  the  highest  speed  is  found 
to  be  1.033.  This  implies  that  VSLP  cannot  schedule  the  task  set 
when  one  or  more  faults  occur  during  each  job.  However,  the 
experiments  show  that  JFTT  can  tolerate  up  to  6  faults  during  each 
job.  The  speed  assignment  (.S’ , ,  s2,  S3)  and  number  of  checkpoints 
(nti,  m2,  m3)  for  Z),  r2  and  %  and  total  energy  consumption  are 
shown  in  Table  1. 

Next  we  show  that  JFTT  saves  more  energy  than  VSLP  in  the 
presence  of  faults  when  both  schemes  are  feasible.  Consider  three 
tasks  T\  =  (12000,  10000,  500),  T2  =  (18000,  16000,  1000)  and  r3  = 
(24000,  22000,  2000),  and  three  normalized  processor  speeds  1.0, 
0.8  and  0.6.  Let  a  single  checkpoint  take  c  =  50  cycles.  The  energy 
saving  for  JFTT  over  VSLP  is  shown  in  Table  2.  Compared  to 
VSLP,  JFTT  can  save  up  to  75%  energy  in  the  presence  of  faults. 

To  demonstrate  the  effect  of  checkpointing  cost,  we  fix  the 
value  of  k  and  change  the  value  of  c  for  the  same  task  set  used  in 
Table  2.  The  results  are  shown  in  Table  3. 

4.2  HFTT  vs.  VSLP 

We  now  show  that  HFTT  can  schedule  task  sets  in  the  presence 
of  faults,  even  when  VSLP  fails  to  do  so.  Suppose  we  are  given 
three  tasks  Z)  =  (12000,  12000,  2200),  r2  =  (18000,  18000,  3000) 
and  r3  =  (24000,  24000,  4000),  and  three  normalized  processor 
speeds  1.0,  0.8  and  0.6.  Let  a  single  checkpoint  take  c  =  50  cycles. 
As  indicated  in  Section  4.1,  VSLP  cannot  schedule  this  task  set 
when  one  or  more  faults  occur  during  each  job.  Here  although  we 
examine  the  fault  occurrence  in  one  hyperperiod,  the  WCET  value 
of  each  task  for  VSLP  remains  the  same  as  that  in  Section  4.1.  As  a 
result,  VSLP  still  cannot  schedule  this  task  set  if  there  are  any  fault 
occurrences.  On  the  other  hand,  HFTT  can  tolerate  more  than  10 
faults  during  a  hyperperiod.  The  speed  assignment  (slt  s2,  s3)  and 
number  of  checkpoints  (mlt  m2,  m3)  for  Z),  T2  and  r3,  and  total 
energy  consumption  are  shown  in  Table  4. 

Next  we  show  that  HFTT  saves  more  energy  than  VSLP  in  the 
presence  of  faults  when  both  schemes  are  feasible.  Consider  three 
tasks  Z)  =  (12000,  10000,  500),  Z2  =  (18000,  16000,  1000)  and  z3  = 
(24000,  22000,  2000),  and  three  normalized  processor  speeds  1 .0, 


k 

JFTT 

VSLP 

(il,  s2,  s3) 

(mi,  m2,  m3) 

Energy 

1 

(0.8,  0.8,  0.8) 

(7,  8.  9) 

29717 

Infeasible 

3 

(0.8,  1.0,  0.8) 

(12,  14,  16) 

40473 

6 

(1.0,  1.0,  1.0) 

(17,  19,  22) 

60530 

Table  1 :  JFTT  vs.  VSLP  (Part  1). 


k 

Engy_JFTT 

EngyJVSLP 

Engy_JFTT/  Engy_VSLP 

1 

6522 

9360 

0.70 

2 

7362 

14040 

0.52 

3 

7981 

33280 

0.24 

Table  2:  JFTT  vs.  VSLP  (Part  2). 


c 

Engy_JFTT 

EngyJVSLP 

Engy_JFTT/  Engy_VSLP 

50 

7981 

33280 

0.24 

150 

10206 

33280 

0.31 

250 

11844 

33280 

0.36 

Table  3:  JFTT  vs.  VSLP  (Part  3). 


k 

HFTT 

VSLP 

(si,  S2,  Si) 

(mi,  m2,  m3) 

Energy 

1 

(0.6,  0.8,  0.8) 

(3,  3,  4) 

21716 

Infeasible 

4 

(0.8,  1.0,  0.8) 

(5,6,  10) 

32187 

10 

(1.0,  1.0,  1.0) 

(10,  14,  19) 

47850 

Table  4:  HFTT  vs.  VSLP  (Part  1). 


k 

Engy_HFTT 

EngyJVSLP 

Engy_HFTT/  EngyJVSLP 

1 

5400 

9360 

0.58 

2 

5508 

14040 

0.39 

3 

5760 

33280 

0.17 

Table  5:  HFTT  vs.  VSLP  (Part  2). 


0.8  and  0.6.  Let  a  single  checkpoint  take  c  =  100  cycles.  The  energy 
saving  for  HFTT  over  VSLP  is  demonstrated  in  Table  5. 

5.  Conclusions 

We  have  shown  how  dynamic  adaptation  for  fault  tolerance  and 
power  management  can  be  carried  out  in  embedded  systems.  Fault 
tolerance  is  achieved  via  checkpointing  and  power  management  is 
carried  out  using  DVS.  We  have  presented  feasibility-of-scheduling 
tests  for  checkpointing  schemes  under  both  constant  processor 
speed  and  variable  processor  speed.  Two  feasibility  tests  have  been 
developed  for  application-level  and  task-level  speed  scaling, 
respectively.  Based  on  the  results  of  the  feasibility  analyses,  on-line 
dynamic  speed  scaling  can  be  employed  to  further  reduce  energy. 
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